A Lexical Database of Portuguese Multiword Expressions

نویسندگان

  • Sandra Antunes
  • Maria Fernanda Bacelar do Nascimento
  • João Miguel Casteleiro
  • Amália Mendes
  • Luísa Pereira
  • Tiago Sá
چکیده

This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms to strongly lexicalized expressions. This new resource has a two-fold objective: to be an important research tool which supports the development of MW units typologies; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions

This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, interpreted with lexical association measures and manually validated. MW expressions conside...

متن کامل

Multilingual Aspects of Multiword Lexical Units

As most of the machine-readable dictionaries contain clearly insufficient information about multiword lexical units, there is a constant need to extend and tune specialized lexical databases to account for new expressions. In this paper, we present a system exclusively based on statistics that massively extracts from unrestricted text corpora contiguous and noncontiguous rigid multiword lexical...

متن کامل

Improving Lexical Databases with Collocational Information: Data from Portuguese

This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can...

متن کامل

Multiword Lexical Acquisition And Dictionary Formalization

In this paper, we present the current state of development of a large-scale lexicon built at LabEL1 for Portuguese. We will concentrate on multiword expressions (MWE), particularly on multiword nouns, (i) illustrating their most relevant morphological features, and (ii) pointing out the methods and techniques adopted to generate the inflected forms from lemmas. Moreover, we describe a corpus-ba...

متن کامل

Representation And Treatment Of Multiword Expressions In Basque

This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a generalpurpose lexical database. Due to its expressive power, th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006